Introduction to Web Scraping¶



ONS / NISR
2021

What is web scraping¶

Web scraping (Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique used to automatically extract large amounts of data from websites and save it to a file or database.

The Internet is a data store of world's information - be it text, media or data in any other format. Every web page display data in one form or the other. Access to this data is crucial for the success of most businesses in the modern world. Unfortunately, most of this data is not open. Most websites do not provide the option to save the data which they display to your local storage, or to your own website.

Why do web scraping¶

Web Scraping is used for getting data. Data collection and analysis is important even for government, non-profit and educational institutions.

The following are few of the many common applications of Web Scraping:

  1. In eCommerce, Web Scraping is used for competition price monitoring.

  2. In Marketing, Web Scraping is used for lead generation, to build phone and email lists for cold outreach.

  3. In Real Estate, Web Scraping is used to get property and agent/owner details.

  4. Web Scraping is used to collect training and testing data for Machine Learning projects.

Is Web Scraping Legal?¶

One of the most frequent questions which comes to your mind once you have decided to scrape data is whether the process of web scraping is legal or not. Scraping data which is already available in public domain is legal as long as you use the data ethically.

Additional considerations¶

Whilst the process of web scraping is legal, consideration should be given to the data that you're attempting to collect. Whilst it may be in the public domain, you may not have a legal standing to collect personal or copyrighted data.

Personal Data - As a rule of thumb, it is recommended to have a lawful reason to obtain, store and use personal data without the user’s consent.

Copyrighted Data - It is not illegal to scrape copyrighted data as long as you don’t plan to reuse or publish it.

What are web pages?¶

HTML (Hypertext Markup Language)¶

The backbone of any web page is HTML. This is a relatively simple markup language that uses <tags>, denoted by angle brackets, to markup different elements.

Open https://www.statistics.gov.rw in any web browser and right click on the page and select View Source.

Creating a Basic HTML page¶

As HTML is just a series of <tags> written in plain text, we can create a web page that can be rendered in any browser just using a text editor.

Create a new file called my_webpage.html and add the following text.

<html> <!-- Open the HTML tag to declare that everything inside is HTML -->
    <body> <!-- Open the body tag, this is where we can write visible elements -->
        <h1>Page title</h1> <!-- h1 stands for Heading, see the use of </> to close the tag -->
        <p>This is my webpage.</p> <!-- p stands for paragraph -->
    </body> <!-- Close the body tag -->
</html> <!-- Close the HTML tag-->

There are plenty of other <tags> we can use in HTML, a full list can be found here

Some common ones you'll see are listed below

Tag Usage
<div> Used to group elements together, or to provide structure to the web page
<span> Used to group elements and to provide structure behaves slightly differently to <div>
<img> Adds an image to the web page
<table>, <th>, <tr>, <td> Defines a table in HTML with the sub-elements defining the table header, table row and table cell respectively.
<a> Create a hyperlink around a specific element
<b>, <i> Create bold and italic elements respectively
<ol>, <ul>, <li> Create ordered and unordered lists where <li> tags list items.

Lets create a second web page called my_complex_webpage.html that incorporates some of these other HTML elements.

<html>
    <body>
        <h1>My Complex Webpage</h1>
        <p>This is my more complex webpage with additional elements</p>
        <a href="https://www.statistics.gov.rw">This is a link to https://www.statistics.gov.rw</a>
        <p>Below here is the NISR logo</p>
        <img src="https://www.statistics.gov.rw/sites/default/files/images/logo.png">
        <h2>This is an unordered list of fruits</h2>
        <ul>
            <li>Apple</li>
            <li>Banana</li>
        </ul>
        <h2>This is a HTML table</h2>
        <table>
            <tr><th>Column 1</th><th>Column2</th><th>Column3</th></tr>
            <tr><td>1</td><td>2</td><td>3</td></tr>
            <tr><td>4</td><td>5</td><td>6</td></tr>
            <tr><td>7</td><td>8</td><td>9</td></tr>
        </table>
    </body>
</html>

Cascading Style Sheets (CSS)¶

<HTML> is good for structure but it isn't very useful for styling elements on a web page. That's where Cascading Style Sheets (CSS) comes in. CSS is a separate language that allows us to apply "styles" to elements on our HTML web page.

For example if we wanted to set the background of our web page to black and the font colour to white we could use the following CSS code.

/* The body tells the browser to only apply the contained styles onto the <body> element */

body {  
    background: black; /* Set the page background to black */
    color: white; /* Set the page font colour to white */
}

Save the above code as style.css

There are two ways to add CSS to our web page. We can add it directly into the HTML document using the <style> tags. More commonly you'll see CSS stored in a separate .css file which is linked in the .html file using the <head> and <link> tags.

The <head> tag is like the body tag, but used for store additional meta information that isn't directly displayed on the page.

<html>
    <head>
        <link rel='stylesheet' href='style.css'>
    </head>
    <body>
        ...
    </body>
</html>

Create a copy of my_complex_webpage.html and add the <head> and <link> tags as described above.

CSS is able to define styles not just for types of elements (i.e. <body>, <li>, <p>) but it can also define classes we can be applied to numerous elements.

/* The "." at the start of the definition tells HTML to apply this style to any elements */
/* that have the specified class name. */

.red_text {
    color: red;
}

Add this style to your style.css file.

We can now use the the class attribute on any HTML element to assign this specific style to specific elements.

<html>
    <head>
        <link rel='stylesheet' href='style.css'>
    </head>
    <body>
        <h1>My Complex Webpage</h1>
        <p class="red_text">This is my more complex webpage with additional elements</p>
        ...
        <h2 class="red_text">This is an unordered list of fruits</h2>
        <ul>
            <li class="red_text">Apple</li>
            <li>Banana</li>
        </ul>
        ...
    <head>
<html>

Edit your copy of my_complex_webpage.html to include the class attribute on some tags.

Congrats, you're officially a web designer!¶

Scraping web pages with Pandas¶

Pandas has a built in function called read_html that allows us to read HTML tables directly from a web page. We can try this with the web page we just finished creating by using the following code.

import pandas as pd 
df = pd.read_html('./my_complex_webpage.html')
df
   Column 1  Column2  Column3
0         1        2        3
1         4        5        6
2         7        8        9

Pandas correctly found our table parsing out all our other HTML, but by default read_html will return a list of all tables that pandas can find on the web page, even if its only one.

Pandas is also able to filter out any of the CSS that's been applied to our tables as well, returning on the data.

import pandas as pd

# Select the first / only dataframe in the list
df_no_css = pd.read_html('./my_complex_webpage.html')[0]
df_css = pd.read_html('./my_complex_webpage_with_css.html')[0]

# This will error if the dataframes aren't identical.
pd.testing.assert_frame_equal(df_no_css, df_css)

Real world application¶

Obviously real world websites are much messier than our example page, so we will also need to employ some basic data cleaning techniques to deal with these real world examples.

Lets look at the wikipedia page for the Rwandan Men's National Basketball Team. There are lots of different tables, in different styles, some with images, some with complex headers. We can throw the URL directly into to read_html and see what comes out.

import pandas as pd 

basketball_tables = pd.read_html('https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team')

print(f'Tables found: {len(basketball_tables)}')
Tables found: 13

Often web developers will use <table> tags as a structural element, rather than to explicitly display some data. Note that the 0th index in basketball_tables doesn't refer to the first visible table, but instead the information card in the top left corner of the page.

Looking through the 13 parsed tables, we can find the current roster table at position 4, but as Wikipedia can change we want to be able to write some code that always selects the roster table. We can do that using the keyword match. The match keyword will return any table containing the string passed.

Once we've done that we can add our usual skiprows and header arguments to make sure the correct row is being used as the header of the table.

url = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'

roster = pd.read_html(url, 
                      match="Rwanda men's national basketball team roster", 
                      skiprows=1, 
                      header=2)[0]
roster.head()
Pos. No. Name Age – Date of birth Height Club Ctr.
0 PG 4 Jean Nshobozwabyose 23 – (1998-06-26)26 June 1998 1.83 m (6 ft 0 in) Patriots NaN
1 G 5 Ntore Habimana 24 – (1997-08-15)15 August 1997 1.96 m (6 ft 5 in) Wilfrid Laurier Golden Hawks NaN
2 SG 6 Steven Hagumintwari 27 – (1993-10-01)1 October 1993 1.93 m (6 ft 4 in) Patriots NaN
3 SG 7 Armel Sangwe 24 – (1997-04-15)15 April 1997 1.90 m (6 ft 3 in) Espoir NaN
4 SG 8 Emile Kazeneza 20 – (2000-08-30)30 August 2000 2.01 m (6 ft 7 in) William Carey University NaN

We've now code that can scrape that table whenever we want. However, something looks a little wrong with the Age - Date of Birth column. Not all the data has been scraped, notably the actual dates of birth.

This is because there is hidden data within these cells. Pandas will stop scraping a cell if it hits hidden data unless we explicitly tell it not to using displayed_only=False.

url = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'

roster = pd.read_html(url, 
                      match="Rwanda men's national basketball team roster", 
                      skiprows=1, 
                      header=2,
                      displayed_only=False)[0]
roster.head()
Pos. No. Name Age – Date of birth Height Club Ctr.
0 PG 4 Jean Nshobozwabyose 23 – (1998-06-26)26 June 1998 1.83 m (6 ft 0 in) Patriots NaN
1 G 5 Ntore Habimana 24 – (1997-08-15)15 August 1997 1.96 m (6 ft 5 in) Wilfrid Laurier Golden Hawks NaN
2 SG 6 Steven Hagumintwari 27 – (1993-10-01)1 October 1993 1.93 m (6 ft 4 in) Patriots NaN
3 SG 7 Armel Sangwe 24 – (1997-04-15)15 April 1997 1.90 m (6 ft 3 in) Espoir NaN
4 SG 8 Emile Kazeneza 20 – (2000-08-30)30 August 2000 2.01 m (6 ft 7 in) William Carey University NaN

There we go, now we've got all the data we want from the table. Unfortunately, as Wikipedia has used images rather than text to represent the countries of the players, we're unable to scrape them using pandas.

We'll look at other methods to get this data later.

Limitations of Pandas for web scraping¶

Obviously we've seem some of the limitations already, notably pandas not being to parse images and it collecting tables that aren't relevant to our intended goal. However in most web pages, the data we want to scrape wont be formatted into a nice table for us. If it isn't in <table> tags then we wont be able to scrape it using pandas.

  • Good for websites with predefined tables
  • Wont collection information that isn't text

However there are lots of other methods for accessing that data, but first we need to understand a little of how websites function.

How do web sites work?¶

Now that we understand the structure of a web page, we can see how it might be extremely tedious to create every individual web page, especially if we want to include regularly changing data.

That's why most web pages are created dynamically. This means that the web page is put together on-the-fly whenever someone requests to see it.

Client-side versus Server-side scripting¶

Web pages are usually generated in one of two ways, via client-side scripting or via server-side script. This defines where the data gets turned into HTML elements. If it is on the client-side, then the raw data is sent directly to our browser and our computer creates the web page, if it is server-side then we don't ever see the raw data, only the computed HTML elements.

Client-side ScriptingServer-side scripting
Data usually processed with JavascriptData can be processed with php, Javascript, python etc.
Is possible to see the underlying dataIs not possible to see the underlying data

Inspecting a web page's creation¶

We've already looked at a web page's source by using View page source. There is a more advanced tool for working with web pages built into most browsers, usually called Inspect. Right-Click > Inspect. Lets inspect the wikipedia page for the Rwandan Men's National Basketball Team.

We'll come back to the Elements page later, for now we want to look at the Network Tab on the toolbar.

The network tab records all the requests that go between our browser and the server (as well as other servers) in the production of the web page. When you first open the page it will be blank. Refreshing the browser page will cause the network tab to record all the different requests that occur.

Clicking on any one of the files that has been requested, you can see the full HTTP request (more on this later) as well as a preview and actual full response from the server for that request. Looking at the response for the first request (the page) we can see that the data was included directly in the page as HTML. This implies that this particular page was processed server-side. Another clue was the reference to php which is an exclusively server-side language.

Client Side Example¶

Lets look at the NBA website instead. This page shows us statistics for the regular season for players in the NBA ordered by the number of points.

We could try and scrape this data using pandas but lets see if we can find the source of this data first. Opening up the Inspect tool we can look at the Network tab to try and find where this data is loaded from.

There are a lot of files that are loaded as part of this web page. We can reduce the number we need to search through by using the built in filters on the Network tab. Lets look at Fetch/XHR which filters the list to requests usually associated with data.

Looking through this shorter list of files, one stands out as potentially containing the data that we want to extract from the web page.

We can click on the file and the the Response tab to see what information is sent to our browser from this file. Looking at the response we can see that the data that goes into our table is not encoded into HTML so we can be relatively sure that web page is generated at least partly on the client side.

Key takeways¶

  • If a website is processed client side then it may be possible to get at the data that creates the web page without having to parse HTML
  • However some web scraping programs wont be able to execute the client side code, meaning we have to use a web browser.
  • If a website is processed server side then it is not possible to get the data without having to parse the HTML served.

Requests Library¶

The requests library is the de-facto standard for making HTTP requests, it abstracts away all of the complexity we just saw using the Inspect tool. requests is a built in library so there is no need to install.

The requests library is very powerful, but importantly we can use it to do in python what our web browser was doing when it loaded in our data.

Returning to our NBA example. Looking at the Network tab shows us all the of the HTTP requests that have been made in the process of creating the web page that we see.

If we look in the Headers tab, we can see the form that this HTTP request took.

The URL had the request information encoded into it, we can also see that the request type is GET.

Lets see what happens if we recreate that request in Python using the requests library. First we need to get the request URL from the header tab. We also need to note that the method is GET.

import requests

url = 'https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2021-22&SeasonType=Regular+Season&StatCategory=PTS'

# We are using the .get method to match the GET HTTP request 
# we also include the .json() method to return to us the response 
# from the request as a python dictionary.
response = requests.get(url).json()

print(response)

We can see that the result of that command is the same as the data we saw using the Inspect tool. We can look through this nested dictionary object to try and understand the structure of the response. It is important to note that not every response will look the same. You'll need to dig into each response to extract the data as and when.

We can look through the object and see if there is a way to convert information into a table that we can use.

print(response.keys())
print(response['resultSet'].keys())
dict_keys(['resource', 'parameters', 'resultSet'])
dict_keys(['name', 'headers', 'rowSet'])

Looking at the keys in the data, we can see that the response contains three objects called resource, parameters and resultSet. resource and parameters are metadata about the table that we've just requested. resultSet contains another dictionary with the keys name, headers and rowSet. rowSet is a list of list, each representing a row of data and the headers contains a list of column headers.

We can put these together using pandas into a dataframe very easily.

import requests
import pandas as pd 

url = 'https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2021-22&SeasonType=Regular+Season&StatCategory=PTS'

response = requests.get(url).json()
table_headers = response['resultSet']['headers']
table_data = response['resultSet']['rowSet']

df = pd.DataFrame(table_data, columns=table_headers)
df
PLAYER_ID RANK PLAYER TEAM GP MIN FGM FGA FG_PCT FG3M ... FT_PCT OREB DREB REB AST STL BLK TOV PTS EFF
0 201142 1 Kevin Durant BKN 10 34.9 11.0 19.2 0.573 1.8 ... 0.800 0.5 8.1 8.6 5.4 0.7 0.8 3.5 28.6 31.2
1 203507 2 Giannis Antetokounmpo MIL 10 32.1 9.7 18.9 0.513 1.4 ... 0.707 2.0 9.1 11.1 6.0 1.2 1.7 3.3 27.3 32.1
2 202331 3 Paul George LAC 9 35.2 9.9 21.3 0.464 3.2 ... 0.878 0.3 7.9 8.2 5.2 2.7 0.6 4.8 27.0 26.9
3 201942 4 DeMar DeRozan CHI 9 34.9 9.6 19.0 0.503 0.9 ... 0.871 0.9 5.0 5.9 3.7 1.0 0.4 1.6 26.8 25.8
4 203897 5 Zach LaVine CHI 9 35.4 9.2 19.6 0.472 2.3 ... 0.895 0.6 5.2 5.8 4.2 0.4 0.3 2.8 26.4 23.4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
268 203085 269 Austin Rivers DEN 7 11.6 0.7 2.7 0.263 0.3 ... 0.500 0.1 0.9 1.0 0.7 0.3 0.0 0.9 1.9 0.9
269 1626161 270 Willie Cauley-Stein DAL 9 11.1 0.8 1.9 0.412 0.0 ... 0.000 0.9 1.7 2.6 0.6 0.3 0.0 0.2 1.6 3.7
270 1630215 271 Jared Butler UTA 8 5.5 0.5 2.1 0.235 0.3 ... 0.500 0.0 0.8 0.8 0.8 0.0 0.6 0.8 1.5 1.0
271 1628975 272 Jevon Carter BKN 10 14.3 0.5 2.8 0.179 0.4 ... 0.000 0.2 1.6 1.8 1.1 0.5 0.4 0.4 1.4 2.5
272 1629216 273 Gabe Vincent MIA 7 7.1 0.6 1.3 0.444 0.1 ... 0.000 0.4 0.6 1.0 1.6 0.0 0.0 0.6 1.3 2.6

273 rows × 24 columns

If we look closer at the URL, we can see it encodes a lot of arguments, these arguments look very similar to the filters that are available on the web page.

https://stats.nba.com/stats/leagueLeaders?
    LeagueID=00&
    PerMode=PerGame&
    Scope=S&
    Season=2021-22&
    SeasonType=Regular+Season&
    StatCategory=PTS

If we change "PerGame" to "Totals" and re-run our code then we should get data that would inform website table had we selected that option. What we've done here is discover the API that sits behind the NBA website and we can exploit this to extract data.

PLAYER_ID RANK PLAYER TEAM GP MIN FGM FGA FG_PCT FG3M ... REB AST STL BLK TOV PF PTS EFF AST_TOV STL_TOV
0 201142 1 Kevin Durant BKN 11 384 123 217 0.567 21 ... 96 58 7 8 36 15 324 351 1.61 0.19
1 201939 2 Stephen Curry GSW 10 336 87 203 0.429 52 ... 66 66 16 7 31 15 276 283 2.13 0.52
2 203507 3 Giannis Antetokounmpo MIL 10 321 97 189 0.513 14 ... 111 60 12 17 33 31 273 321 1.82 0.36
3 201942 4 DeMar DeRozan CHI 10 349 96 191 0.503 9 ... 57 36 10 4 17 21 269 253 2.12 0.59
4 1628970 5 Miles Bridges CHA 12 439 96 215 0.447 32 ... 89 39 21 10 20 33 267 275 1.95 1.05
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
440 201586 415 Serge Ibaka LAC 1 8 0 3 0.000 0 ... 1 1 0 1 1 5 0 -1 1.00 0.00
441 1630536 415 Sharife Cooper ATL 2 7 0 3 0.000 0 ... 0 2 0 0 1 0 0 -2 2.00 0.00
442 1629605 415 Tacko Fall CLE 2 3 0 1 0.000 0 ... 2 0 0 0 1 0 0 0 0.00 0.00
443 1630176 415 Vernon Carey Jr. CHA 1 1 0 1 0.000 0 ... 1 0 0 0 0 0 0 0 0.00 0.00
444 1627782 415 Wayne Selden NYK 1 1 0 0 0.000 0 ... 0 0 0 0 0 0 0 0 0.00 0.00

445 rows × 27 columns

Processing HTML data¶

Sometimes, in-fact most of the time, the information that we want to scrape wont be found neatly formatted into a table. What we need to be able to do is extract the relevant information programatically from non-table objects. Enter beautifulsoup. beautifulsoup is a HTML parsing library for python, it allows us to pull out all the relevant information from a web page using a nice and easy to use syntax.

beautifulsoup does not come as part of the standard python installation so we need to pip install it. We can do this inside our jupyternotebook using

cmd
!pip install beautifulsoup4

Or just on the command line by running the same command, without the ! at the begining of the line.

Collecting beautifulsoup4
  Using cached beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)
Collecting soupsieve>1.2
  Downloading soupsieve-2.3-py3-none-any.whl (37 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.10.0 soupsieve-2.3

Once we've installed beautiful soup we can start to use it to parse our HTML data. Lets start again by parsing the web page that we made earlier.

from bs4 import BeautifulSoup 

with open('./my_complex_webpage.html', 'r') as f:
    soup = BeautifulSoup(f, 'html.parser')

print(soup)
<html>
<head>
<link href="style.css" rel="stylesheet"/>
</head>
<body>
<h1>My Complex Webpage</h1>
<p class="red_text">This is my more complex webpage with additional elements</p>
<a href="https://www.statistics.gov.rw">This is a link to https://www.statistics.gov.rw</a>
<p>Below here is the NISR lo
...

beautifulsoup has lots of functions that make it very easy to extract information from a HTML page. The most useful of which is the find_all() method. Full documentation for the find_all method can be found here.

Before we were able to use pandas to extract the HTML table very easily, but what if we were more interested in the "Unordered list of fruits". We can use the find_all function to retrieve all of the list item <li> tags.

soup.find_all('li')
[<li class="red_text">Apple</li>, <li>Banana</li>]

We've successfully extracted all of the <li> tags, but our data still isn't very clean. We're not interested in the HTML tag, just the data contained within. We can deal with this by using beautifulsoup to strip out our HTML tags.

# We can do this with a loop
for tag in soup.find_all('li'):
    print(tag.get_text())

# Or by using a list comprehension
[tag.get_text() for tag in soup.find_all('li')]
Apple
Banana
['Apple', 'Banana']

Success, however it will be common that not all the information we want to extract has the same <tag> or there will be lots of irrelevant information that has the same <tag>. Fortunately, when people are designing web pages they tend to give similar information the same visual appearance. We know that visual appearance is controlled by css and using beautifulsoup we can extract data by css class!

for red_text in soup.find_all(class_="red_text"):
    print(red_text)

[red_text.get_text() for red_text in soup.find_all(class_='red_text')]
<p class="red_text">This is my more complex webpage with additional elements</p>
<h2 class="red_text">This is an unordered list of fruits</h2>
<li class="red_text">Apple</li>
['This is my more complex webpage with additional elements',
 'This is an unordered list of fruits',
 'Apple']

Lets go back to our wikipedia example, if we remember we were able to use pandas to scrape the table, but weren't able to get all of the country information because this wasn't stored as plain text. We can use beautifulsoup to parse out that information with much finer control.

First we need to get the HTML that generates that wikipedia page. We can do this using our trusty requests library.

import requests

URL = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'
wiki_page = requests.get(URL)
print(wiki_page)
<Response [200]>

Oh, this is just a response code, not the HTML that we were expecting. Fortunately Response [200] means that the request successfully executed. In order to get the HTML we need to use the .text attribute.

import requests

URL = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'
wiki_page = requests.get(URL).text
print(wiki_page)
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Rwanda men's national basketball team - Wikipedia</title>
<script>document.documentElement.classNam
...

Now we can parse this HTML code with beautiful soup, as we're only interested in the roster table, we can tell beautifulsoup to filter out all the HTML that isn't related to the roster table.

Found 13 tables.

<table class="sortable" style="background:transparent; margin:0px; width:100%;">
<tbody><tr>
<th><abbr title="Position(s)">Pos.</abbr></th>
<th><abbr title="Number">No.</abbr></th>
<th>Name</th>
<th>Age – <small>Date of birth</small></th>
<th>Height</th>
<th>Club</th>
<th><abbr title="Country">Ctr.<
...
# Use a list comprehension to look for all the <th> tags, for each 
# one, get the text and strip the result. These are the column headers
# for the table.
header = [col.get_text().strip() for col in roster_html.find_all('th')]

# Create an empty list to store our processed rows.
rows = []
# Loop over all of the <tr> tags, each set corresponds to a row
# row in our table. 
for tr in roster_html.find_all('tr')[1:]:
    # Create an empty row variable where we can store all of our processed
    # data
    row = []
    # Loop over all of the <td> tags inside the current <tr> tag. These are 
    # going to be our data items.
    for data in tr.find_all('td'):

        # If the data item isn't blank (or just a a new line character)
        # then add it to our row, stripping out the excess whitespace
        if data.get_text() != '\n':
            row.append(data.get_text().strip())

        # If there is an <img> tag in the <td> tag then we're on our 
        # flag column. We want to extract the country information. 
        # We could extract this from the image, but all the images are 
        # wrapped in a <a> hyperlink tag to that country, which will be
        # easier to clean. 
        if data.find('img') is not None:
            # Get the <a> hyperlink tag
            img = data.find('a')
            # Add the href attribute (this is the link address) to our row
            row.append(img['href'])

    # Finally add the row into our list of rows.
    rows.append(row)

# Construct a dataframe from our list of rows and our header data
df = pd.DataFrame(rows, columns=header)
df
Pos. No. Name Age – Date of birth Height Club Ctr.
0 PG 4 Jean Nshobozwabyose 23 – (1998-06-26)26 June 1998 1.83 m (6 ft 0 in) Patriots /wiki/Rwanda
1 G 5 Ntore Habimana 24 – (1997-08-15)15 August 1997 1.96 m (6 ft 5 in) Wilfrid Laurier Golden Hawks /wiki/Canada
2 SG 6 Steven Hagumintwari 27 – (1993-10-01)1 October 1993 1.93 m (6 ft 4 in) Patriots /wiki/Rwanda
3 SG 7 Armel Sangwe 24 – (1997-04-15)15 April 1997 1.90 m (6 ft 3 in) Espoir /wiki/Rwanda
4 SG 8 Emile Kazeneza 20 – (2000-08-30)30 August 2000 2.01 m (6 ft 7 in) William Carey University /wiki/United_States
5 SG 9 Dieudonné Ndizeye 24 – (1996-10-14)14 October 1996 1.98 m (6 ft 6 in) Patriots /wiki/Rwanda
6 PF 10 Olivier Shyaka 26 – (1995-08-14)14 August 1995 2.00 m (6 ft 7 in) REG /wiki/Rwanda
7 F 11 Alex Mpoyo 24 – (1997-01-05)5 January 1997 2.01 m (6 ft 7 in) Trepça /wiki/Kosovo
8 SG 12 Kenny Gasana 36 – (1984-11-09)9 November 1984 1.90 m (6 ft 3 in) Patriots /wiki/Rwanda
9 C 13 Elie Kaje 26 – (1995-03-17)17 March 1995 1.90 m (6 ft 3 in) Patriots /wiki/Rwanda
10 C 16 Prince Ibeh 27 – (1994-06-03)3 June 1994 2.06 m (6 ft 9 in) Patriots /wiki/Rwanda
11 SF 17 William Robeyns 25 – (1996-02-23)23 February 1996 1.91 m (6 ft 3 in) Phoenix Brussels /wiki/Belgium

Selenium Library¶

Scrapy Library¶